[Zerobubble] Merge Main. #6107

duanjunwen · 2024-11-01T03:16:47Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234
#6037

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.
This PR will support fully HybridParallel (we only support Hybrid parallelism associated with tensor parallelism in #6083). Also, model policy including llama and mixtral will support zerobubble.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

cast_to_fp8, cast_from_fp8, all_reduce_fp8

for more information, see https://pre-commit.ci

…p8_comm # Conflicts: # colossalai/quantization/fp8.py

for more information, see https://pre-commit.ci

Feature/fp8 comm

[Feature] FP8 communication in ShardFormer

[Shardformer] Fix Shardformer FP8 communication training accuracy degradation

[fp8] add fp8 comm for low level zero

* add llama shardformer fp8 * Llama Shardformer Parity * fix typo * fix all reduce * fix pytest failure * fix reduce op and move function to fp8.py * fix typo

* add SimPO * fix dataloader * remove debug code * add orpo * fix style * fix colossalai, transformers version * fix colossalai, transformers version * fix colossalai, transformers version * fix torch colossalai version * update transformers version * [shardformer] DeepseekMoE support (hpcaitech#5871) * [Feature] deepseek moe expert parallel implement * [misc] fix typo, remove redundant file (hpcaitech#5867) * [misc] fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] deepseek support & unit test * [misc] remove debug code & useless print * [misc] fix typos (hpcaitech#5872) * [Feature] remove modeling file, use auto config. (hpcaitech#5884) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [Deepseek] remove redundant code (hpcaitech#5888) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [misc] remove redundant code * [Feature/deepseek] resolve comment. (hpcaitech#5889) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [misc] remove redundant code * [misc] mv module replacement into if branch * [misc] add some warning message and modify some code in unit test * [misc] fix typos --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Hoxfix] Fix CUDA_DEVICE_MAX_CONNECTIONS for comm overlap Co-authored-by: Edenzzzz <[email protected]> * [Feat] Diffusion Model(PixArtAlpha/StableDiffusion3) Support (hpcaitech#5838) * Diffusion Model Inference support * Stable Diffusion 3 Support * pixartalpha support * [HotFix] CI,import,requirements-test for hpcaitech#5838 (hpcaitech#5892) * [Hot Fix] CI,import,requirements-test --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] Enable PP + SP for llama (hpcaitech#5868) * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use a one cross entropy func for all shardformer models --------- Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [ShardFormer] Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM (hpcaitech#5897) * add benchmark for sft, dpo, simpo, orpo. Add benchmarking result. Support lora with gradient checkpoint * fix style * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix eval * hotfix citation * [zero] support all-gather overlap (hpcaitech#5898) * [zero] support all-gather overlap * [zero] add overlap all-gather flag * [misc] fix typo * [zero] update api * fix orpo cross entropy loss * [Auto Parallel]: Speed up intra-op plan generation by 44% (hpcaitech#5446) * Remove unnecessary calls to deepcopy * Build DimSpec's difference dict only once This change considerably speeds up construction speed of DimSpec objects. The difference_dict is the same for each DimSpec object, so a single copy of it is enough. * Fix documentation of DimSpec's difference method * [ShardFormer] fix qwen2 sp (hpcaitech#5903) * [compatibility] support torch 2.2 (hpcaitech#5875) * Support Pytorch 2.2.2 * keep build_on_pr file and update .compatibility * fix object_to_tensor usage when torch>=2.3.0 (hpcaitech#5820) * [misc] support torch2.3 (hpcaitech#5893) * [misc] support torch2.3 * [devops] update compatibility ci * [devops] update compatibility ci * [devops] add debug * [devops] add debug * [devops] add debug * [devops] add debug * [devops] remove debug * [devops] remove debug * [release] update version (hpcaitech#5912) * [plugin] support all-gather overlap for hybrid parallel (hpcaitech#5919) * [plugin] fixed all-gather overlap support for hybrid parallel * add kto * fix style, add kto data sample * [Examples] Add lazy init to OPT and GPT examples (hpcaitech#5924) Co-authored-by: Edenzzzz <[email protected]> * [ColossalChat] Hotfix for ColossalChat (hpcaitech#5910) * add ignore and tiny llama * fix path issue * run style * fix issue * update bash * add ignore and tiny llama * fix path issue * run style * fix issue * update bash * fix ddp issue * add Qwen 1.5 32B * refactor tokenization * [FIX BUG] UnboundLocalError: cannot access local variable 'default_conversation' where it is not associated with a value (hpcaitech#5931) * cannot access local variable 'default_conversation' where it is not associated with a value set default value for 'default_conversation' * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix test data * refactor evaluation * remove real data path * remove real data path * Add n_fused as an input from native_module (hpcaitech#5894) * [FIX BUG] convert env param to int in (hpcaitech#5934) * [Hotfix] Fix ZeRO typo hpcaitech#5936 Co-authored-by: Edenzzzz <[email protected]> * [Feature] Add a switch to control whether the model checkpoint needs to be saved after each epoch ends (hpcaitech#5941) * Add a switch to control whether the model checkpoint needs to be saved after each epoch ends * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix style * fix style * fix style * [shardformer] hotfix attn mask (hpcaitech#5945) * [shardformer] hotfix attn mask (hpcaitech#5947) * [Feat] Distrifusion Acceleration Support for Diffusion Inference (hpcaitech#5895) * Distrifusion Support source * comp comm overlap optimization * sd3 benchmark * pixart distrifusion bug fix * sd3 bug fix and benchmark * generation bug fix * naming fix * add docstring, fix counter and shape error * add reference * readme and requirement * [zero] hotfix update master params (hpcaitech#5951) * [release] update version (hpcaitech#5952) * [Chat] Fix lora (hpcaitech#5946) * fix merging * remove filepath * fix style * Update README.md (hpcaitech#5958) * [hotfix] Remove unused plan section (hpcaitech#5957) * remove readme * fix readme * update * [test] add mixtral for sequence classification * [test] add mixtral transformer test * [moe] fix plugin * [test] mixtra pp shard test * [chore] handle non member group * [zero] solve hang * [test] pass mixtral shardformer test * [moe] implement transit between non moe tp and ep * [zero] solve hang * [misc] solve booster hang by rename the variable * solve hang when parallel mode = pp + dp * [moe] implement submesh initialization * [moe] add mixtral dp grad scaling when not all experts are activated * [chore] manually revert unintended commit * [chore] trivial fix * [chore] arg pass & remove drop token * [test] add mixtral modelling test * [moe] implement tp * [moe] test deepseek * [moe] clean legacy code * [Feature] MoE Ulysses Support (hpcaitech#5918) * moe sp support * moe sp bug solve * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [chore] minor fix * [moe] init moe plugin comm setting with sp * moe sp + ep bug fix * [moe] finalize test (no pp) * [moe] full test for deepseek and mixtral (pp + sp to fix) * [chore] minor fix after rebase * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [chore] solve moe ckpt test failure and some other arg pass failure * [moe] remove ops * [test] fix test: test_zero1_2 * [bug] fix: somehow logger hangs the program * [moe] deepseek moe sp support * [test] add check * [deepseek] replace attn (a workaround for bug in transformers) * [misc] skip redunant test * [misc] remove debug/print code * [moe] refactor mesh assignment * Revert "[moe] implement submesh initialization" This reverts commit 2f9bce6. * [chore] change moe_pg_mesh to private * [misc] remove incompatible test config * [misc] fix ci failure: change default value to false in moe plugin * [misc] remove useless condition * [chore] docstring * [moe] remove force_overlap_comm flag and add warning instead * [doc] add MoeHybridParallelPlugin docstring * [moe] solve dp axis issue * [chore] remove redundant test case, print string & reduce test tokens * [feat] Dist Loader for Eval (hpcaitech#5950) * support auto distributed data loader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support auto distributed data loader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix tp error * remove unused parameters * remove unused * update inference * update docs * update inference --------- Co-authored-by: Michelle <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [lora] lora support hybrid parallel plugin (hpcaitech#5956) * lora support hybrid plugin * fix * fix * fix * fix * fp8 operators for compressed communication cast_to_fp8, cast_from_fp8, all_reduce_fp8 * fix scaling algorithm in FP8 casting * support fp8 communication in pipeline parallelism * add fp8_communication flag in the script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * shardformer fp8 * fix rebase * remove all to all * fix shardformer fp8 communication training degradation * [fp8] support all-gather flat tensor (hpcaitech#5932) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * Update low_level_optim.py --------- Co-authored-by: YeAnbang <[email protected]> Co-authored-by: Haze188 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: Runyu Lu <[email protected]> Co-authored-by: Guangyao Zhang <[email protected]> Co-authored-by: YeAnbang <[email protected]> Co-authored-by: Hongxin Liu <[email protected]> Co-authored-by: Stephan Kö <[email protected]> Co-authored-by: アマデウス <[email protected]> Co-authored-by: Tong Li <[email protected]> Co-authored-by: zhurunhua <[email protected]> Co-authored-by: Insu Jang <[email protected]> Co-authored-by: Gao, Ruiyuan <[email protected]> Co-authored-by: hxwang <[email protected]> Co-authored-by: Michelle <[email protected]> Co-authored-by: Wang Binluo <[email protected]> Co-authored-by: HangXu <[email protected]>

* support all2all fp8 * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [fp8] add fp8 linear * [test] fix fp8 linear test condition * [test] fix fp8 linear test condition * [test] fix fp8 linear test condition

* [fp8] support fp8 amp for hybrid parallel plugin * [test] add fp8 hook test * [fp8] fix fp8 linear compatibility

…5928) * support fp8_communication in the Torch DDP grad comm, FSDP grad comm, and FSDP params comm * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implement communication hook for FSDP params all-gather * added unit test for fp8 operators * support fp8 communication in GeminiPlugin * update training scripts to support fsdp and fp8 communication * fixed some minor bugs observed in unit test * add all_gather_into_tensor_flat_fp8 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add skip the test if torch < 2.2.0 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add skip the test if torch < 2.2.0 * add skip the test if torch < 2.2.0 * add fp8_comm flag * rebase latest fp8 operators * rebase latest fp8 operators * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix * fix * fix

* [fp8] refactor hook * [fp8] support gemini plugin * [example] add fp8 option for llama benchmark

* [fp8] use torch compile (torch >= 2.4.0) * [fp8] set use_fast_accum in linear * [chore] formal version check * [chore] fix sig

[hotfix] fix lora ckpt saving format

* [doc] sora solution news * [doc] sora solution news

…ime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp);

* add reasoner * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update code * delete llama * update prompts * update readme * update readme --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

… use weightGradStore and not use WeightGradStore

updates: - [github.com/psf/black-pre-commit-mirror: 24.8.0 → 24.10.0](psf/black-pre-commit-mirror@24.8.0...24.10.0) - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.2](pre-commit/mirrors-clang-format@v18.1.8...v19.1.2) - [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](pre-commit/pre-commit-hooks@v4.6.0...v5.0.0) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…into dev/zero_bubble

BurkeHulk and others added 30 commits July 1, 2024 13:44

fp8 operators for compressed communication

f5a52e1

cast_to_fp8, cast_from_fp8, all_reduce_fp8

Merge branch 'hpcaitech:main' into feature/fp8_comm

6991819

[pre-commit.ci] auto fixes from pre-commit.com hooks

e17f835

for more information, see https://pre-commit.ci

fix typo

dbfa7d3

fix scaling algorithm in FP8 casting

1e19594

support fp8 communication in pipeline parallelism

e881901

add fp8_communication flag in the script

6601874

Merge remote-tracking branch 'origin/feature/fp8_comm' into feature/f…

1f1b856

…p8_comm # Conflicts: # colossalai/quantization/fp8.py

[pre-commit.ci] auto fixes from pre-commit.com hooks

51f916b

for more information, see https://pre-commit.ci

Merge pull request hpcaitech#5885 from BurkeHulk/feature/fp8_comm

9470701

Feature/fp8 comm

shardformer fp8

457a0de

fix rebase

5a310b9

remove all to all

6a20f07

Merge pull request hpcaitech#5899 from BurkeHulk/SP_fp8

d0bdb51

[Feature] FP8 communication in ShardFormer

fix shardformer fp8 communication training degradation

5b969fd

Merge pull request hpcaitech#5921 from BurkeHulk/fp8_fix

62661cd

[Shardformer] Fix Shardformer FP8 communication training accuracy degradation

[fp8] support all-gather flat tensor (hpcaitech#5932)

5fd0592

[fp8] add fp8 comm for low level zero

ae486ce

[test] add zero fp8 test case

91e596d

Merge pull request hpcaitech#5961 from ver217/feature/zeor-fp8

c297e21

[fp8] add fp8 comm for low level zero

[Feature] llama shardformer fp8 support (hpcaitech#5938)

53cb960

* add llama shardformer fp8 * Llama Shardformer Parity * fix typo * fix all reduce * fix pytest failure * fix reduce op and move function to fp8.py * fix typo

[fp8] add fp8 linear (hpcaitech#5967)

76ea164

* [fp8] add fp8 linear * [test] fix fp8 linear test condition * [test] fix fp8 linear test condition * [test] fix fp8 linear test condition

[fp8] support fp8 amp for hybrid parallel plugin (hpcaitech#5975)

ccabcf6

* [fp8] support fp8 amp for hybrid parallel plugin * [test] add fp8 hook test * [fp8] fix fp8 linear compatibility

fix (hpcaitech#5976)

7739629

[test ci]Feature/fp8 comm (hpcaitech#5981)

4b9bec8

* fix * fix * fix

[fp8] support gemini plugin (hpcaitech#5978)

8241c0c

* [fp8] refactor hook * [fp8] support gemini plugin * [example] add fp8 option for llama benchmark

[fp8] use torch compile (torch >= 2.3.0) (hpcaitech#5979)

e4aadee

* [fp8] use torch compile (torch >= 2.4.0) * [fp8] set use_fast_accum in linear * [chore] formal version check * [chore] fix sig

BurkeHulk and others added 20 commits October 21, 2024 13:55

fix lora ckpt save format (ColoTensor to Tensor)

b10339d

pre-commit fix

6d6cafa

Merge pull request hpcaitech#6096 from BurkeHulk/hotfix/lora_ckpt

dee63cc

[hotfix] fix lora ckpt saving format

[extension] hotfix compile check (hpcaitech#6099)

80a8ca9

[doc] sora solution news (hpcaitech#6100)

4294ae8

* [doc] sora solution news * [doc] sora solution news

[feat] support meta cache, meta_grad_send, meta_tensor_send; fix runt…

2eca112

…ime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp);

[fix\ fix fail case test_shard_llama

d0ec221

[fix] fix test_shard_llama

cc0dfdd

[fix] fix llama modeling policy;

03fa79a

[fix] fix test_shard_llama ci;

6377aa0

[fix] fix test zerobubble

5aee426

[fix] fix handle name; rm useless comments;

fafe049

[fix] fix send recv signature;

fa3ccda

[fix] fix comment in llama & benchmark

982e4ee

[feat] support no tensor parallel Linear in shardformer; Add test for…

d2e05a9

… use weightGradStore and not use WeightGradStore

[fix] fix linear (no tp) ops func name;

5f09243

[checkpointio] fix hybrid plugin model save (hpcaitech#6106)

c2e8f61

Merge branch 'main' into dev/zero_bubble

1d328ff

duanjunwen requested a review from a team as a code owner November 1, 2024 03:16

duanjunwen changed the title ~~[ZeroBubble] Support Fully HybridParallel Plugin.~~ [Zerobubble] Merge Main. Nov 1, 2024

duanjunwen added 5 commits November 1, 2024 03:32

Merge branch 'feature/zerobubble' of github.com:hpcaitech/ColossalAI …

c82c75a

…into dev/zero_bubble

[fix] fix fp8 args in HybridParallel

3b5c314

[fix] fix hybridparall use_fp8 config

5b5fbcf

[fix] fix use_fp8 flag

0218e67

[fix] fix model zoo init

8e40087

duanjunwen requested a review from ver217 November 5, 2024 03:11

ver217 approved these changes Nov 5, 2024

View reviewed changes

duanjunwen merged commit 37b23e3 into hpcaitech:feature/zerobubble Nov 5, 2024
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zerobubble] Merge Main. #6107

[Zerobubble] Merge Main. #6107

duanjunwen commented Nov 1, 2024 •

edited

Loading

[Zerobubble] Merge Main. #6107

[Zerobubble] Merge Main. #6107

Conversation

duanjunwen commented Nov 1, 2024 • edited Loading

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

duanjunwen commented Nov 1, 2024 •

edited

Loading